Final Group Project — TFL Cycle Analysis¶

Group 2¶

Team Members:

  • Luisa Leite
  • Konstantinos Paschalis
  • Eshwar Nevedh
  • Ben Encel
  • Yiru Nie
  • Joy Lee
  • Kaiyang Dai

Project Synopsis:¶

This notebook explores the TFL Cycle Hire dataset , creating a regression model to understand the factors influencing bike rentals in London.

Workflow¶

  1. Data Acquisition & Cleaning

    • Load the raw TFL cycle hire dataset.
    • Handle missing values and inconsistent data types.
    • Create new engineered features (e.g., weekend, COVID flag, daylight time).
  2. Exploratory Data Analysis (EDA)

    • Summarize descriptive statistics of bike hires and weather conditions.
    • Visualize patterns across time (daily, weekly, monthly, seasonal).
    • Explore relationships between bike hires and external factors like temperature, precipitation, and COVID restrictions.
  3. Modeling Approach

    • Apply OLS regression to measure correlations and infer causation between weather, time, and external factors on bike hires.
    • Experiment with interaction terms and categorical variables (day of week, season, COVID).
    • Evaluate model fit using R², Adjusted R², RMSE, and residual diagnostics.
In [200]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates # Import for custom date formatting
import seaborn as sns
from datetime import datetime
from patsy import dmatrices
from skimpy import skim  # Used for generating summary statistics of DataFrames
import warnings
import xgboost as xgb
warnings.filterwarnings('ignore')

# Statistical analysis libraries
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.stats.diagnostic import het_breuschpagan
from statsmodels.stats.outliers_influence import variance_inflation_factor
from scipy import stats
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.preprocessing import StandardScaler
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.model_selection import train_test_split
from sklearn.inspection import PartialDependenceDisplay
In [201]:
# Set plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Configure pandas display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

1. Data Acquisition and Cleaning¶

Importing data¶
In [202]:
# =============================================================================
# London Bikes data
# =============================================================================

# We'll use data from http://www.tfl.gov.uk to analyse usage of the London Bike Sharing scheme. 
# This data has already been downloaded for you and exists in a CSV (Comma Separated Values) 
# file that you have to read in to Python.

# Read the CSV file
bike = pd.read_csv(r"C:\Users\SKEN\my-project\london_bikes.csv")  

# Display the first few rows of the dataframe
print("\nFirst few rows of the dataset:") 
print(bike.head())

# Create weekend indicator
bike['weekend'] = bike['wday'].isin(['Sat', 'Sun'])

print(f"Dataset shape: {bike.shape}")
print(f"Date range: {bike['date'].min()} to {bike['date'].max()}")
First few rows of the dataset:
                   date  bikes_hired  tempmax  tempmin  temp  feelslikemax  \
0  2010-07-30T00:00:00Z         6897     21.5     10.7  16.8          21.5   
1  2010-07-31T00:00:00Z         5564     23.0     15.0  18.9          23.0   
2  2010-08-01T00:00:00Z         4303     21.0     14.0  17.4          21.0   
3  2010-08-02T00:00:00Z         6642     20.4     14.3  17.1          20.4   
4  2010-08-03T00:00:00Z         7966     21.6     11.6  17.2          21.6   

   feelslikemin  feelslike   dew  humidity  precip  precipprob  precipcover  \
0          10.7       16.8  10.3      68.1   0.000           0         0.00   
1          15.0       18.9  14.3      76.3   2.029         100         8.33   
2          14.0       17.4  11.5      69.1   0.000           0         0.00   
3          14.3       17.1  11.6      71.1   3.635         100         8.33   
4          11.6       17.2  11.4      71.4   0.153         100         8.33   

  preciptype  snow  snowdepth  windgust  windspeed  winddir  sealevelpressure  \
0        NaN   0.0        0.0       NaN       19.2    252.2            1014.0   
1       rain   0.0        0.0       NaN       19.7    246.0            1010.9   
2        NaN   0.0        0.0       NaN       13.4    260.6            1012.7   
3       rain   0.0        0.0       NaN       16.2    327.6            1016.3   
4       rain   0.0        0.0       NaN       17.6    252.7            1015.3   

   cloudcover  visibility  solarradiation  solarenergy  uvindex  severerisk  \
0        56.1        22.9           277.5         23.9        9         NaN   
1        59.2        19.7           228.0         19.6        7         NaN   
2        64.7        25.2           257.8         22.3        8         NaN   
3        75.8        24.5           265.9         23.0        7         NaN   
4        50.9        21.8           285.3         24.7        8         NaN   

                sunrise                sunset  moonphase  \
0  2010-07-30T05:17:39Z  2010-07-30T20:50:39Z       0.65   
1  2010-07-31T05:19:09Z  2010-07-31T20:49:03Z       0.68   
2  2010-08-01T05:20:39Z  2010-08-01T20:47:25Z       0.71   
3  2010-08-02T05:22:11Z  2010-08-02T20:45:45Z       0.75   
4  2010-08-03T05:23:42Z  2010-08-03T20:44:03Z       0.75   

               conditions  \
0        Partially cloudy   
1  Rain, Partially cloudy   
2        Partially cloudy   
3  Rain, Partially cloudy   
4  Rain, Partially cloudy   

                                                  description  \
0                           Partly cloudy throughout the day.   
1                 Partly cloudy throughout the day with rain.   
2                           Partly cloudy throughout the day.   
3                 Partly cloudy throughout the day with rain.   
4  Becoming cloudy in the afternoon with rain clearing later.   

                icon month_name  month day_of_week wday  week  year  weekend  \
0  partly-cloudy-day        Jul      7         Fri  Fri    30  2010    False   
1               rain        Jul      7         Sat  Sat    30  2010     True   
2  partly-cloudy-day        Aug      8         Sun  Sun    30  2010     True   
3               rain        Aug      8         Mon  Mon    31  2010    False   
4               rain        Aug      8         Tue  Tue    31  2010    False   

  season_name    set  
0      Summer  train  
1      Summer  train  
2      Summer  train  
3      Summer  train  
4      Summer  train  
Dataset shape: (5268, 41)
Date range: 2010-07-30T00:00:00Z to 2024-12-30T00:00:00Z
Cleaning Data¶
In [203]:
# =============================================================================
# Cleaning our data
# =============================================================================


# Fix dates using pandas, and generate new variables for year, month, month_name, day, and day_of_week

# Convert the 'date' column from text to actual date format so Python knows it's dates
bike['date'] = pd.to_datetime(bike['date'])

# Create a list with the correct order of month abbreviations
# This ensures months appear in calendar order instead of alphabetical

bike = bike.assign(
    year=bike['date'].dt.year,
    month=bike['date'].dt.month,
    month_name=bike['date'].dt.month_name()
)
In [204]:
# Dropping the empty value outliers due to system maintenance in later 2022

bike = bike.drop(bike[bike["date"].isin(["2022-09-09", "2022-09-10", "2022-09-11"])].index)

# Dropping the severe risk column due to it being all nulls
bike = bike.drop(columns="severerisk",axis=1)

# Filling the precip type NaN values with Nil to capture the precipitation effect if needed
bike["preciptype"] = bike["preciptype"].fillna("Nil")

# Checking for other nulls across the columns
bike[bike.isna().any(axis=1)]
Out[204]:
date bikes_hired tempmax tempmin temp feelslikemax feelslikemin feelslike dew humidity precip precipprob precipcover preciptype snow snowdepth windgust windspeed winddir sealevelpressure cloudcover visibility solarradiation solarenergy uvindex sunrise sunset moonphase conditions description icon month_name month day_of_week wday week year weekend season_name set
0 2010-07-30 00:00:00+00:00 6897 21.5 10.7 16.8 21.5 10.7 16.8 10.3 68.1 0.000 0 0.00 Nil 0.0 0.0 NaN 19.2 252.2 1014.0 56.1 22.9 277.5 23.9 9 2010-07-30T05:17:39Z 2010-07-30T20:50:39Z 0.65 Partially cloudy Partly cloudy throughout the day. partly-cloudy-day July 7 Fri Fri 30 2010 False Summer train
1 2010-07-31 00:00:00+00:00 5564 23.0 15.0 18.9 23.0 15.0 18.9 14.3 76.3 2.029 100 8.33 rain 0.0 0.0 NaN 19.7 246.0 1010.9 59.2 19.7 228.0 19.6 7 2010-07-31T05:19:09Z 2010-07-31T20:49:03Z 0.68 Rain, Partially cloudy Partly cloudy throughout the day with rain. rain July 7 Sat Sat 30 2010 True Summer train
2 2010-08-01 00:00:00+00:00 4303 21.0 14.0 17.4 21.0 14.0 17.4 11.5 69.1 0.000 0 0.00 Nil 0.0 0.0 NaN 13.4 260.6 1012.7 64.7 25.2 257.8 22.3 8 2010-08-01T05:20:39Z 2010-08-01T20:47:25Z 0.71 Partially cloudy Partly cloudy throughout the day. partly-cloudy-day August 8 Sun Sun 30 2010 True Summer train
3 2010-08-02 00:00:00+00:00 6642 20.4 14.3 17.1 20.4 14.3 17.1 11.6 71.1 3.635 100 8.33 rain 0.0 0.0 NaN 16.2 327.6 1016.3 75.8 24.5 265.9 23.0 7 2010-08-02T05:22:11Z 2010-08-02T20:45:45Z 0.75 Rain, Partially cloudy Partly cloudy throughout the day with rain. rain August 8 Mon Mon 31 2010 False Summer train
4 2010-08-03 00:00:00+00:00 7966 21.6 11.6 17.2 21.6 11.6 17.2 11.4 71.4 0.153 100 8.33 rain 0.0 0.0 NaN 17.6 252.7 1015.3 50.9 21.8 285.3 24.7 8 2010-08-03T05:23:42Z 2010-08-03T20:44:03Z 0.75 Rain, Partially cloudy Becoming cloudy in the afternoon with rain clearing later. rain August 8 Tue Tue 31 2010 False Summer train
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1580 2014-11-26 00:00:00+00:00 23096 8.9 7.7 8.3 8.9 5.2 7.3 8.1 98.4 9.630 100 16.67 rain 0.0 0.0 NaN 15.0 91.7 1011.3 96.3 3.2 18.2 1.5 1 2014-11-26T07:35:03Z 2014-11-26T15:55:30Z 0.14 Rain, Overcast Cloudy skies throughout the day with a chance of rain throughout the day. rain November 11 Wed Wed 48 2014 False Autumn train
1581 2014-11-27 00:00:00+00:00 26765 10.6 8.8 9.8 10.6 7.4 8.7 9.5 98.2 1.637 100 12.50 rain 0.0 0.0 NaN 15.1 128.4 1003.5 84.1 6.0 36.2 3.3 2 2014-11-27T07:36:35Z 2014-11-27T15:54:38Z 0.18 Rain, Partially cloudy Partly cloudy throughout the day with a chance of rain throughout the day. rain November 11 Thu Thu 48 2014 False Autumn train
1582 2014-11-28 00:00:00+00:00 26782 11.3 9.8 10.5 11.3 7.8 10.3 9.7 94.9 0.129 100 4.17 rain 0.0 0.0 NaN 24.2 85.7 1004.1 83.3 6.8 51.0 4.3 2 2014-11-28T07:38:06Z 2014-11-28T15:53:48Z 0.21 Rain, Partially cloudy Partly cloudy throughout the day with morning rain. rain November 11 Fri Fri 48 2014 False Autumn train
1583 2014-11-29 00:00:00+00:00 24337 11.5 6.4 9.5 11.5 4.7 8.5 8.1 91.7 0.038 100 4.17 rain 0.0 0.0 NaN 19.8 106.6 1010.7 74.3 7.1 45.4 3.9 2 2014-11-29T07:39:35Z 2014-11-29T15:53:02Z 0.25 Rain, Partially cloudy Partly cloudy throughout the day with early morning rain. rain November 11 Sat Sat 48 2014 True Autumn train
1584 2014-11-30 00:00:00+00:00 19998 10.6 4.4 8.1 10.6 2.2 6.4 7.4 95.6 0.196 100 8.33 rain 0.0 0.0 NaN 24.0 331.2 1013.2 81.6 6.5 30.5 2.5 2 2014-11-30T07:41:03Z 2014-11-30T15:52:19Z 0.28 Rain, Partially cloudy Partly cloudy throughout the day with rain. rain November 11 Sun Sun 48 2014 True Autumn train

1126 rows × 40 columns

Times, Days and Seasons¶
In [205]:
# Define a list of month names
month_order = ['January', 'February', 'March', 'April', 'May', 'June', 
               'July', 'August', 'September', 'October', 'November', 'December']

# Define a list of day names
day_order = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']

# Convert the 'day_of_week' column to a categorical type with the specified order
# Categorical = tells pandas this column has a specific order, not just random text
bike['day_of_week'] = pd.Categorical(bike['day_of_week'], 
                                     categories=day_order, 
                                     ordered=True)


# Convert the 'month_name' column to a categorical type with the specified order
bike['month_name'] = pd.Categorical(bike['month_name'], 
                                    categories=month_order, 
                                    ordered=True)

# Generate new variable season_name to turn seasons from numbers to Winter, Spring, etc
# This function takes a month name and returns which season it belongs to
def get_season(month_name):
    if month_name in ['December', 'January', 'February']:  # Winter months
        return 'Winter'
    elif month_name in ['March', 'April', 'May']:  # Spring months
        return 'Spring'
    elif month_name in ['June', 'July', 'August']:  # Summer months
        return 'Summer'
    else:  # Autumn months (Sep, Oct, Nov)
        return 'Autumn'

# Apply the season function to each row to create a new 'season_name' column
bike['season_name'] = bike['month_name'].apply(get_season)

# Make season_name an ordered categorical so seasons appear in logical order
bike['season_name'] = pd.Categorical(bike['season_name'], 
                                    categories=['Winter', 'Spring', 'Summer', 'Autumn'], 
                                    ordered=True)

bike.head()
Out[205]:
date bikes_hired tempmax tempmin temp feelslikemax feelslikemin feelslike dew humidity precip precipprob precipcover preciptype snow snowdepth windgust windspeed winddir sealevelpressure cloudcover visibility solarradiation solarenergy uvindex sunrise sunset moonphase conditions description icon month_name month day_of_week wday week year weekend season_name set
0 2010-07-30 00:00:00+00:00 6897 21.5 10.7 16.8 21.5 10.7 16.8 10.3 68.1 0.000 0 0.00 Nil 0.0 0.0 NaN 19.2 252.2 1014.0 56.1 22.9 277.5 23.9 9 2010-07-30T05:17:39Z 2010-07-30T20:50:39Z 0.65 Partially cloudy Partly cloudy throughout the day. partly-cloudy-day July 7 Fri Fri 30 2010 False Summer train
1 2010-07-31 00:00:00+00:00 5564 23.0 15.0 18.9 23.0 15.0 18.9 14.3 76.3 2.029 100 8.33 rain 0.0 0.0 NaN 19.7 246.0 1010.9 59.2 19.7 228.0 19.6 7 2010-07-31T05:19:09Z 2010-07-31T20:49:03Z 0.68 Rain, Partially cloudy Partly cloudy throughout the day with rain. rain July 7 Sat Sat 30 2010 True Summer train
2 2010-08-01 00:00:00+00:00 4303 21.0 14.0 17.4 21.0 14.0 17.4 11.5 69.1 0.000 0 0.00 Nil 0.0 0.0 NaN 13.4 260.6 1012.7 64.7 25.2 257.8 22.3 8 2010-08-01T05:20:39Z 2010-08-01T20:47:25Z 0.71 Partially cloudy Partly cloudy throughout the day. partly-cloudy-day August 8 Sun Sun 30 2010 True Summer train
3 2010-08-02 00:00:00+00:00 6642 20.4 14.3 17.1 20.4 14.3 17.1 11.6 71.1 3.635 100 8.33 rain 0.0 0.0 NaN 16.2 327.6 1016.3 75.8 24.5 265.9 23.0 7 2010-08-02T05:22:11Z 2010-08-02T20:45:45Z 0.75 Rain, Partially cloudy Partly cloudy throughout the day with rain. rain August 8 Mon Mon 31 2010 False Summer train
4 2010-08-03 00:00:00+00:00 7966 21.6 11.6 17.2 21.6 11.6 17.2 11.4 71.4 0.153 100 8.33 rain 0.0 0.0 NaN 17.6 252.7 1015.3 50.9 21.8 285.3 24.7 8 2010-08-03T05:23:42Z 2010-08-03T20:44:03Z 0.75 Rain, Partially cloudy Becoming cloudy in the afternoon with rain clearing later. rain August 8 Tue Tue 31 2010 False Summer train

Creating new variables¶

In order to improve our models, we create new variables to discover further relationships.¶
In [206]:
# Adding a new daylight_time column
bike['sunset_dt'] = pd.to_datetime(bike['sunset'])
bike['sunrise_dt'] = pd.to_datetime(bike['sunrise'])

# Daylight_time is calculated as the sunset time minus the sunrise time
bike['daylight_time'] = bike['sunset_dt'] - bike['sunrise_dt']
bike['daylight_time'] = bike['daylight_time'].astype('int64') / 10000000

# Adding a binary variable COVID for the time period that was affected, to capture variability due to Covid.
bike['COVID'] = np.where(
    bike['date'].between('2020-03-16', '2022-12-31'),
    1,
    0
)

# Creating a variable precipitation squared to capture the polynomial effect on the bikes hired
bike['precip2'] = bike['precip'] ** 2

2. Exploratory Data Analysis¶

We explore the dataset, looking at the different outliers and distribution of the dataset, along with different visualizations between variables.¶
In [207]:
# Inspecting for the shape and overall summary statistics for bikes and bikes_hired
print(bike.info())
print(bike.head())

print("Overall summary statistics for bikes_hired:")
print(bike['bikes_hired'].describe())

skim(bike)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5268 entries, 0 to 5267
Data columns (total 45 columns):
 #   Column            Non-Null Count  Dtype              
---  ------            --------------  -----              
 0   date              5268 non-null   datetime64[ns, UTC]
 1   bikes_hired       5268 non-null   int64              
 2   tempmax           5268 non-null   float64            
 3   tempmin           5268 non-null   float64            
 4   temp              5268 non-null   float64            
 5   feelslikemax      5268 non-null   float64            
 6   feelslikemin      5268 non-null   float64            
 7   feelslike         5268 non-null   float64            
 8   dew               5268 non-null   float64            
 9   humidity          5268 non-null   float64            
 10  precip            5268 non-null   float64            
 11  precipprob        5268 non-null   int64              
 12  precipcover       5268 non-null   float64            
 13  preciptype        5268 non-null   object             
 14  snow              5268 non-null   float64            
 15  snowdepth         5268 non-null   float64            
 16  windgust          4142 non-null   float64            
 17  windspeed         5268 non-null   float64            
 18  winddir           5268 non-null   float64            
 19  sealevelpressure  5268 non-null   float64            
 20  cloudcover        5268 non-null   float64            
 21  visibility        5268 non-null   float64            
 22  solarradiation    5268 non-null   float64            
 23  solarenergy       5268 non-null   float64            
 24  uvindex           5268 non-null   int64              
 25  sunrise           5268 non-null   object             
 26  sunset            5268 non-null   object             
 27  moonphase         5268 non-null   float64            
 28  conditions        5268 non-null   object             
 29  description       5268 non-null   object             
 30  icon              5268 non-null   object             
 31  month_name        5268 non-null   category           
 32  month             5268 non-null   int32              
 33  day_of_week       5268 non-null   category           
 34  wday              5268 non-null   object             
 35  week              5268 non-null   int64              
 36  year              5268 non-null   int32              
 37  weekend           5268 non-null   bool               
 38  season_name       5268 non-null   category           
 39  set               5268 non-null   object             
 40  sunset_dt         5268 non-null   datetime64[ns, UTC]
 41  sunrise_dt        5268 non-null   datetime64[ns, UTC]
 42  daylight_time     5268 non-null   float64            
 43  COVID             5268 non-null   int64              
 44  precip2           5268 non-null   float64            
dtypes: bool(1), category(3), datetime64[ns, UTC](3), float64(23), int32(2), int64(5), object(8)
memory usage: 1.6+ MB
None
                       date  bikes_hired  tempmax  tempmin  temp  \
0 2010-07-30 00:00:00+00:00         6897     21.5     10.7  16.8   
1 2010-07-31 00:00:00+00:00         5564     23.0     15.0  18.9   
2 2010-08-01 00:00:00+00:00         4303     21.0     14.0  17.4   
3 2010-08-02 00:00:00+00:00         6642     20.4     14.3  17.1   
4 2010-08-03 00:00:00+00:00         7966     21.6     11.6  17.2   

   feelslikemax  feelslikemin  feelslike   dew  humidity  precip  precipprob  \
0          21.5          10.7       16.8  10.3      68.1   0.000           0   
1          23.0          15.0       18.9  14.3      76.3   2.029         100   
2          21.0          14.0       17.4  11.5      69.1   0.000           0   
3          20.4          14.3       17.1  11.6      71.1   3.635         100   
4          21.6          11.6       17.2  11.4      71.4   0.153         100   

   precipcover preciptype  snow  snowdepth  windgust  windspeed  winddir  \
0         0.00        Nil   0.0        0.0       NaN       19.2    252.2   
1         8.33       rain   0.0        0.0       NaN       19.7    246.0   
2         0.00        Nil   0.0        0.0       NaN       13.4    260.6   
3         8.33       rain   0.0        0.0       NaN       16.2    327.6   
4         8.33       rain   0.0        0.0       NaN       17.6    252.7   

   sealevelpressure  cloudcover  visibility  solarradiation  solarenergy  \
0            1014.0        56.1        22.9           277.5         23.9   
1            1010.9        59.2        19.7           228.0         19.6   
2            1012.7        64.7        25.2           257.8         22.3   
3            1016.3        75.8        24.5           265.9         23.0   
4            1015.3        50.9        21.8           285.3         24.7   

   uvindex               sunrise                sunset  moonphase  \
0        9  2010-07-30T05:17:39Z  2010-07-30T20:50:39Z       0.65   
1        7  2010-07-31T05:19:09Z  2010-07-31T20:49:03Z       0.68   
2        8  2010-08-01T05:20:39Z  2010-08-01T20:47:25Z       0.71   
3        7  2010-08-02T05:22:11Z  2010-08-02T20:45:45Z       0.75   
4        8  2010-08-03T05:23:42Z  2010-08-03T20:44:03Z       0.75   

               conditions  \
0        Partially cloudy   
1  Rain, Partially cloudy   
2        Partially cloudy   
3  Rain, Partially cloudy   
4  Rain, Partially cloudy   

                                                  description  \
0                           Partly cloudy throughout the day.   
1                 Partly cloudy throughout the day with rain.   
2                           Partly cloudy throughout the day.   
3                 Partly cloudy throughout the day with rain.   
4  Becoming cloudy in the afternoon with rain clearing later.   

                icon month_name  month day_of_week wday  week  year  weekend  \
0  partly-cloudy-day       July      7         Fri  Fri    30  2010    False   
1               rain       July      7         Sat  Sat    30  2010     True   
2  partly-cloudy-day     August      8         Sun  Sun    30  2010     True   
3               rain     August      8         Mon  Mon    31  2010    False   
4               rain     August      8         Tue  Tue    31  2010    False   

  season_name    set                 sunset_dt                sunrise_dt  \
0      Summer  train 2010-07-30 20:50:39+00:00 2010-07-30 05:17:39+00:00   
1      Summer  train 2010-07-31 20:49:03+00:00 2010-07-31 05:19:09+00:00   
2      Summer  train 2010-08-01 20:47:25+00:00 2010-08-01 05:20:39+00:00   
3      Summer  train 2010-08-02 20:45:45+00:00 2010-08-02 05:22:11+00:00   
4      Summer  train 2010-08-03 20:44:03+00:00 2010-08-03 05:23:42+00:00   

   daylight_time  COVID    precip2  
0      5598000.0      0   0.000000  
1      5579400.0      0   4.116841  
2      5560600.0      0   0.000000  
3      5541400.0      0  13.213225  
4      5522100.0      0   0.023409  
Overall summary statistics for bikes_hired:
count     5268.000000
mean     26327.596811
std       9496.916456
min          0.000000
25%      19746.750000
50%      26070.000000
75%      32745.000000
max      73094.000000
Name: bikes_hired, dtype: float64
╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮
│          Data Summary                Data Types               Categories                                        │
│ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓ ┏━━━━━━━━━━━━━━━━━━━━━━━┓                                │
│ ┃ Dataframe         ┃ Values ┃ ┃ Column Type ┃ Count ┃ ┃ Categorical Variables ┃                                │
│ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩ ┡━━━━━━━━━━━━━━━━━━━━━━━┩                                │
│ │ Number of rows    │ 5268   │ │ float64     │ 23    │ │ month_name            │                                │
│ │ Number of columns │ 45     │ │ string      │ 8     │ │ day_of_week           │                                │
│ └───────────────────┴────────┘ │ int64       │ 7     │ │ season_name           │                                │
│                                │ datetime64  │ 3     │ └───────────────────────┘                                │
│                                │ category    │ 3     │                                                          │
│                                │ bool        │ 1     │                                                          │
│                                └─────────────┴───────┘                                                          │
│                                                     number                                                      │
│ ┏━━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┓  │
│ ┃ column   ┃ NA   ┃ NA %      ┃ mean    ┃ sd      ┃ p0      ┃ p25     ┃ p50     ┃ p75     ┃ p100    ┃ hist   ┃  │
│ ┡━━━━━━━━━━╇━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━┩  │
│ │ bikes_hi │    0 │         0 │   26330 │    9497 │       0 │   19750 │   26070 │   32740 │   73090 │  ▁▇▇▃  │  │
│ │ red      │      │           │         │         │         │         │         │         │         │        │  │
│ │ tempmax  │    0 │         0 │   14.73 │   6.364 │    -2.6 │    10.1 │    14.4 │    19.5 │    36.1 │ ▁▅▇▇▂  │  │
│ │ tempmin  │    0 │         0 │   7.836 │   5.039 │   -11.1 │     4.1 │       8 │    11.6 │    22.1 │  ▁▆▇▆▁ │  │
│ │ temp     │    0 │         0 │   11.23 │   5.481 │    -5.9 │     7.2 │    11.1 │    15.5 │    28.3 │  ▃▇▇▃  │  │
│ │ feelslik │    0 │         0 │   14.09 │   7.239 │    -7.2 │    10.1 │    14.4 │    19.5 │    38.2 │ ▁▅▇▇▂  │  │
│ │ emax     │      │           │         │         │         │         │         │         │         │        │  │
│ │ feelslik │    0 │         0 │   6.185 │   6.376 │   -14.6 │   1.175 │       6 │    11.6 │    22.1 │  ▂▇▆▇▁ │  │
│ │ emin     │      │           │         │         │         │         │         │         │         │        │  │
│ │ feelslik │    0 │         0 │   10.09 │   6.668 │    -9.1 │     4.9 │    10.6 │    15.5 │    28.9 │ ▁▅▇▇▅  │  │
│ │ e        │      │           │         │         │         │         │         │         │         │        │  │
│ │ dew      │    0 │         0 │   7.413 │   4.672 │      -8 │     4.1 │     7.6 │      11 │      19 │  ▂▆▇▆▁ │  │
│ │ humidity │    0 │         0 │   79.28 │    10.4 │    39.6 │    72.2 │    80.3 │    87.3 │    99.9 │  ▁▃▇▇▃ │  │
│ │ precip   │    0 │         0 │   1.641 │   4.733 │       0 │       0 │   0.185 │   1.337 │   168.2 │   ▇    │  │
│ │ precippr │    0 │         0 │   73.04 │   44.38 │       0 │       0 │     100 │     100 │     100 │ ▃    ▇ │  │
│ │ ob       │      │           │         │         │         │         │         │         │         │        │  │
│ │ precipco │    0 │         0 │   8.959 │    11.6 │       0 │       0 │    8.33 │    12.5 │     100 │   ▇▁   │  │
│ │ ver      │      │           │         │         │         │         │         │         │         │        │  │
│ │ snow     │    0 │         0 │ 0.01517 │  0.2454 │       0 │       0 │       0 │       0 │     8.4 │   ▇    │  │
│ │ snowdept │    0 │         0 │ 0.04677 │  0.5492 │       0 │       0 │       0 │       0 │    18.1 │   ▇    │  │
│ │ h        │      │           │         │         │         │         │         │         │         │        │  │
│ │ windgust │ 1126 │ 21.374335 │   42.08 │   14.84 │     9.4 │   30.73 │    40.7 │    51.8 │   120.9 │  ▃▇▅▁  │  │
│ │          │      │  61123766 │         │         │         │         │         │         │         │        │  │
│ │ windspee │    0 │         0 │   22.95 │   8.421 │     5.1 │   16.88 │    21.8 │    27.6 │    72.2 │  ▃▇▃▁  │  │
│ │ d        │      │           │         │         │         │         │         │         │         │        │  │
│ │ winddir  │    0 │         0 │     197 │   92.33 │     0.1 │   127.8 │   221.6 │     259 │   359.8 │ ▃▃▃▇▇▃ │  │
│ │ sealevel │    0 │         0 │    1015 │   10.52 │   966.6 │    1009 │    1016 │    1022 │    1048 │   ▃▇▅  │  │
│ │ pressure │      │           │         │         │         │         │         │         │         │        │  │
│ │ cloudcov │    0 │         0 │   60.86 │   22.45 │       0 │    46.6 │    62.7 │    77.8 │     100 │ ▁▂▅▇▇▅ │  │
│ │ er       │      │           │         │         │         │         │         │         │         │        │  │
│ │ visibili │    0 │         0 │   22.65 │   10.62 │     0.2 │    15.2 │      22 │    28.9 │    62.5 │ ▃▇▇▃▁  │  │
│ │ ty       │      │           │         │         │         │         │         │         │         │        │  │
│ │ solarrad │    0 │         0 │     112 │   84.89 │       0 │    40.2 │    91.5 │   169.8 │   358.6 │ ▇▅▃▃▁▁ │  │
│ │ iation   │      │           │         │         │         │         │         │         │         │        │  │
│ │ solarene │    0 │         0 │   9.663 │   7.336 │       0 │     3.5 │     7.9 │    14.6 │      31 │ ▇▅▃▃▁▁ │  │
│ │ rgy      │      │           │         │         │         │         │         │         │         │        │  │
│ │ uvindex  │    0 │         0 │    4.26 │   2.553 │       0 │       2 │       4 │       6 │      10 │ ▅▇▃▆▆▁ │  │
│ │ moonphas │    0 │         0 │  0.4831 │  0.2888 │       0 │    0.25 │     0.5 │    0.75 │    0.98 │ ▇▇▇▇▇▇ │  │
│ │ e        │      │           │         │         │         │         │         │         │         │        │  │
│ │ month    │    0 │         0 │   6.623 │   3.456 │       1 │       4 │       7 │      10 │      12 │ ▇▇▇▇▇▇ │  │
│ │ week     │    0 │         0 │   27.01 │   15.08 │       1 │      14 │      27 │      40 │      53 │ ▇▇▇▇▇▇ │  │
│ │ year     │    0 │         0 │    2017 │   4.169 │    2010 │    2014 │    2017 │    2021 │    2024 │ ▆▅▅▇▅▇ │  │
│ │ daylight │    0 │         0 │ 4405000 │ 1085000 │ 2803000 │ 3363000 │ 4410000 │ 5441000 │ 6006000 │ ▇▅▅▅▅▇ │  │
│ │ _time    │      │           │         │         │         │         │         │         │         │        │  │
│ │ COVID    │    0 │         0 │  0.1938 │  0.3953 │       0 │       0 │       0 │       0 │       1 │ ▇    ▂ │  │
│ │ precip2  │    0 │         0 │   25.09 │   515.2 │       0 │       0 │ 0.03422 │   1.786 │   28290 │   ▇    │  │
│ └──────────┴──────┴───────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴────────┘  │
│                                                    category                                                     │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓  │
│ ┃ column                           ┃ NA        ┃ NA %           ┃ ordered               ┃ unique             ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩  │
│ │ month_name                       │         0 │              0 │ True                  │                 12 │  │
│ │ day_of_week                      │         0 │              0 │ True                  │                  7 │  │
│ │ season_name                      │         0 │              0 │ True                  │                  4 │  │
│ └──────────────────────────────────┴───────────┴────────────────┴───────────────────────┴────────────────────┘  │
│                                                      bool                                                       │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┓  │
│ ┃ column                     ┃ true              ┃ true rate                        ┃ hist                   ┃  │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━┩  │
│ │ weekend                    │              1506 │                             0.29 │         ▇    ▃         │  │
│ └────────────────────────────┴───────────────────┴──────────────────────────────────┴────────────────────────┘  │
│                                                    datetime                                                     │
│ ┏━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓  │
│ ┃ column          ┃ NA   ┃ NA %    ┃ first                       ┃ last                       ┃ frequency    ┃  │
│ ┡━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩  │
│ │ date            │    0 │       0 │         2010-07-30          │         2024-12-30         │ D            │  │
│ │ sunset_dt       │    0 │       0 │     2010-07-30 20:50:39     │    2024-12-30 16:00:26     │ None         │  │
│ │ sunrise_dt      │    0 │       0 │     2010-07-30 05:17:39     │    2024-12-30 08:06:26     │ None         │  │
│ └─────────────────┴──────┴─────────┴─────────────────────────────┴────────────────────────────┴──────────────┘  │
│                                                     string                                                      │
│ ┏━━━━━━━━━━━┳━━━━┳━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━┓  │
│ ┃           ┃    ┃      ┃           ┃           ┃           ┃           ┃ chars per ┃ words per ┃ total      ┃  │
│ ┃ column    ┃ NA ┃ NA % ┃ shortest  ┃ longest   ┃ min       ┃ max       ┃ row       ┃ row       ┃ words      ┃  │
│ ┡━━━━━━━━━━━╇━━━━╇━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━┩  │
│ │ preciptyp │  0 │    0 │ Nil       │ rain,snow │ Nil       │ snow      │         4 │         1 │       5268 │  │
│ │ e         │    │      │           │           │           │           │           │           │            │  │
│ │ sunrise   │  0 │    0 │ 2010-07-3 │ 2010-07-3 │ 2010-07-3 │ 2024-12-3 │        20 │         1 │       5268 │  │
│ │           │    │      │ 0T05:17:3 │ 0T05:17:3 │ 0T05:17:3 │ 0T08:06:2 │           │           │            │  │
│ │           │    │      │ 9Z        │ 9Z        │ 9Z        │ 6Z        │           │           │            │  │
│ │ sunset    │  0 │    0 │ 2010-07-3 │ 2010-07-3 │ 2010-07-3 │ 2024-12-3 │        20 │         1 │       5268 │  │
│ │           │    │      │ 0T20:50:3 │ 0T20:50:3 │ 0T20:50:3 │ 0T16:00:2 │           │           │            │  │
│ │           │    │      │ 9Z        │ 9Z        │ 9Z        │ 6Z        │           │           │            │  │
│ │ condition │  0 │    0 │ Rain      │ Snow,     │ Clear     │ Snow,     │      19.3 │       2.6 │      13832 │  │
│ │ s         │    │      │           │ Rain,     │           │ Rain,     │           │           │            │  │
│ │           │    │      │           │ Partially │           │ Partially │           │           │            │  │
│ │           │    │      │           │ cloudy    │           │ cloudy    │           │           │            │  │
│ │ descripti │  0 │    0 │ Clearing  │ Clear     │ Becoming  │ Partly    │      50.5 │       8.2 │      43311 │  │
│ │ on        │    │      │ in the    │ condition │ cloudy in │ cloudy    │           │           │            │  │
│ │           │    │      │ afternoon │ s         │ the       │ throughou │           │           │            │  │
│ │           │    │      │ .         │ throughou │ afternoon │ t the     │           │           │            │  │
│ │           │    │      │           │ t the day │ with a    │ day.      │           │           │            │  │
│ │           │    │      │           │ with a    │ chance of │           │           │           │            │  │
│ │           │    │      │           │ chance of │ rain or   │           │           │           │            │  │
│ │           │    │      │           │ rain or   │ snow      │           │           │           │            │  │
│ │           │    │      │           │ snow      │ throughou │           │           │           │            │  │
│ │           │    │      │           │ throughou │ t the     │           │           │           │            │  │
│ │           │    │      │           │ t the     │ day.      │           │           │           │            │  │
│ │           │    │      │           │ day.      │           │           │           │           │            │  │
│ │ icon      │  0 │    0 │ rain      │ partly-cl │ clear-day │ wind      │         7 │         1 │       5268 │  │
│ │           │    │      │           │ oudy-day  │           │           │           │           │            │  │
│ │ wday      │  0 │    0 │ Fri       │ Fri       │ Fri       │ Wed       │         3 │         1 │       5268 │  │
│ │ set       │  0 │    0 │ train     │ train     │ train     │ train     │         5 │         1 │       5268 │  │
│ └───────────┴────┴──────┴───────────┴───────────┴───────────┴───────────┴───────────┴───────────┴────────────┘  │
╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯
In [208]:
#Summary statistics by different variables and relationships

print("\nSummary statistics by year:")
print(bike.groupby('year')['bikes_hired'].describe())

print("\nSummary statistics by day of week:")
print(bike.groupby('day_of_week')['bikes_hired'].describe())
Summary statistics by year:
      count          mean           std     min       25%      50%       75%  \
year                                                                           
2010  155.0  14069.761290   5616.909424  2764.0   9297.00  14010.0  18677.50   
2011  365.0  19568.353425   5497.468615  4555.0  16260.00  20264.0  23708.00   
2012  366.0  26008.969945   9429.395314  3531.0  19282.00  26178.5  32532.75   
2013  365.0  22042.353425   7276.488364  3728.0  17555.00  22021.0  27371.00   
2014  365.0  27462.731507   9065.086440  4327.0  20532.00  27676.0  34437.00   
2015  365.0  27046.134247   8547.890451  5779.0  22056.00  26618.0  32857.00   
2016  366.0  28152.013661   8850.957827  4894.0  22402.00  27881.5  35130.00   
2017  365.0  28619.298630   8376.552423  5143.0  24064.00  29490.0  34459.00   
2018  365.0  28952.164384  10174.346177  5859.0  21759.00  29190.0  37677.00   
2019  365.0  28561.520548   8087.904111  5649.0  24198.00  28927.0  34494.00   
2020  366.0  28508.653005  11587.302615  4872.0  20147.25  27508.0  36487.25   
2021  365.0  29976.065753  10990.961868  6253.0  21703.00  30988.0  38207.00   
2022  365.0  31522.936986  10337.169182     0.0  25096.00  31397.0  40020.00   
2023  365.0  23373.063014   6029.498261  6721.0  19949.00  23692.0  27805.00   
2024  365.0  23987.378082   6143.741429  7479.0  20436.00  24806.0  28656.00   

          max  
year           
2010  27964.0  
2011  29417.0  
2012  47102.0  
2013  35580.0  
2014  49025.0  
2015  73094.0  
2016  46625.0  
2017  46035.0  
2018  46084.0  
2019  44668.0  
2020  70170.0  
2021  56922.0  
2022  66972.0  
2023  35319.0  
2024  35190.0  

Summary statistics by day of week:
             count          mean           std     min       25%      50%  \
day_of_week                                                                 
Mon          753.0  26050.938911   8277.117203  3971.0  20644.00  25748.0   
Tue          752.0  28178.922872   8618.748160  3763.0  22462.75  27854.0   
Wed          752.0  28372.493351   8523.335169  4327.0  22752.75  27956.0   
Thu          752.0  28275.652926   8788.769687  5649.0  22657.75  27534.0   
Fri          753.0  26983.482072   8544.883666  5402.0  21405.00  26532.0   
Sat          753.0  24222.066401  11082.313167     0.0  16009.00  22475.0   
Sun          753.0  22217.382470  10498.340099     0.0  14080.00  20971.0   

                  75%      max  
day_of_week                     
Mon          31469.00  67034.0  
Tue          34435.50  65326.0  
Wed          34228.50  54360.0  
Thu          34432.75  73094.0  
Fri          32857.00  66972.0  
Sat          30847.00  70170.0  
Sun          29669.00  63116.0  
In [209]:
print("\nSummary statistics by month:")
print(bike.groupby('month_name')['bikes_hired'].describe())

print("\nSummary statistics by season:")
print(bike.groupby('season_name')['bikes_hired'].describe())
Summary statistics by month:
            count          mean          std      min       25%      50%  \
month_name                                                                 
January     434.0  18652.470046  5992.562937   3728.0  13910.25  18964.0   
February    396.0  20312.861111  6372.171831   3531.0  15978.25  20646.5   
March       434.0  22710.850230  7589.303757   5062.0  17586.75  23052.5   
April       420.0  25999.145238  7614.414499   4872.0  21016.50  25932.5   
May         434.0  30424.000000  8443.833748  10684.0  24531.25  29895.0   
June        420.0  33403.642857  8593.874083   6061.0  27701.50  33057.5   
July        436.0  34687.383028  8346.706529   5564.0  29070.50  35781.5   
August      465.0  31372.529032  9596.474662   4303.0  25638.00  32439.0   
September   450.0  30364.573333  8112.695324      0.0  24866.75  31168.5   
October     465.0  27463.823656  6768.732139   7068.0  23281.00  28015.0   
November    450.0  23271.486667  6406.332308   6030.0  18766.50  23685.5   
December    464.0  17082.303879  6908.340305   2764.0  11396.25  16652.0   

                 75%      max  
month_name                     
January     23331.00  38042.0  
February    24553.50  52485.0  
March       27191.75  56568.0  
April       30838.00  49025.0  
May         36045.00  70170.0  
June        39235.50  65326.0  
July        41336.25  73094.0  
August      38252.00  66972.0  
September   36088.50  51764.0  
October     32568.00  47877.0  
November    27944.00  44677.0  
December    22908.00  39145.0  

Summary statistics by season:
              count          mean          std     min       25%      50%  \
season_name                                                                 
Winter       1294.0  18597.568779  6576.148434  2764.0  13696.75  18803.0   
Spring       1288.0  26382.116460  8505.519735  4872.0  20758.75  26095.0   
Summer       1321.0  33112.380772  8982.562793  4303.0  27571.00  33705.0   
Autumn       1365.0  27038.025641  7691.127484     0.0  21886.00  27205.0   

                  75%      max  
season_name                     
Winter       23515.75  52485.0  
Spring       31343.25  70170.0  
Summer       39582.00  73094.0  
Autumn       32569.00  51764.0  
In [210]:
print("\nSummary statistics by weather type:")
print(bike.groupby('icon')['bikes_hired'].describe())

print("\nSummary statistics by precipitation:")
print(bike.groupby('preciptype')['bikes_hired'].describe())
Summary statistics by weather type:
                    count          mean          std      min       25%  \
icon                                                                      
clear-day           195.0  35032.994872  9594.133137  10558.0  28971.00   
cloudy               96.0  22891.750000  7984.384212   7262.0  17242.75   
partly-cloudy-day  1127.0  31646.354037  8937.950401   4303.0  25909.00   
rain               3797.0  24581.882802  8735.699590      0.0  18553.00   
snow                 51.0  12412.039216  6271.264600   2764.0   7071.00   
wind                  2.0  14437.000000   708.520995  13936.0  14186.50   

                       50%      75%      max  
icon                                          
clear-day          35196.0  42212.5  63116.0  
cloudy             22414.5  28001.0  41447.0  
partly-cloudy-day  31834.0  38515.0  70170.0  
rain               24370.0  30285.0  73094.0  
snow               12300.0  16780.5  28765.0  
wind               14437.0  14687.5  14938.0  

Summary statistics by precipitation:
             count          mean          std     min       25%      50%  \
preciptype                                                                 
Nil         1411.0  31617.082211  9247.522684  4303.0  25546.50  31605.0   
rain        3556.0  25000.573678  8722.382286     0.0  18969.25  24898.5   
rain,snow    287.0  17300.104530  6669.224668  2764.0  12280.00  18061.0   
snow          14.0  15350.500000  5938.452592  5621.0  11722.50  15772.0   

                 75%      max  
preciptype                     
Nil         38767.50  70170.0  
rain        30784.75  73094.0  
rain,snow   22549.00  35267.0  
snow        20800.50  22460.0  
In [211]:
# Time series plot of bikes rented
# While summary statistics allow us to quickly discover key metrics that represent 
# the data, they are often not sufficient by themselves. Often, it is useful to 
# represent the data graphically (or visually) to uncover information that is 
# critical for decision making, such as trends, patterns and outliers.

# Filter data from January 1, 2014 onwards
bike_filtered = bike[bike['date'] >= pd.to_datetime('2014-01-01', utc=True)].copy()

# Create time series plot showcasing bikes hired across past years
plt.figure(figsize=(15, 6))
sns.scatterplot(data=bike_filtered, x='date', y='bikes_hired', alpha=0.4)
plt.title('Time Series Plot of Bikes Hired (2014 onwards)', fontsize=14, fontweight='bold')
plt.xlabel('Date')
plt.ylabel('Bikes Hired')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
No description has been provided for this image
In [212]:
# Creating a histogram plot showcasing the frequency counts of bikes rented every day
sns.histplot(
    data=bike,           # Specify the DataFrame to use
    x='bikes_hired',     # Specify the column to plot
    bins=40,             # Divide the data into 30 bins (bars)
    alpha=0.7,           # Set transparency (0=invisible, 1=solid)
    color='blue',       # Set the fill color of the bars
    edgecolor='grey',    # Add grey lines around each bar for better definition
    kde=True             # Add a smooth density curve on top of the histogram
)

# Customize plot labels and title
plt.title('Histogram of Bikes Rented', fontsize=16)  # Main title
plt.xlabel('Bikes Hired', fontsize=12)               # X-axis label
plt.ylabel('Frequency', fontsize=12)                 # Y-axis label

# Ensure the layout is clean and labels don't overlap
plt.tight_layout()

# Display the plot
plt.show()
No description has been provided for this image
In [213]:
# FACETED HISTOGRAM BY SEASON
# Create separate histograms for each season to compare distributions

sns.set_style("whitegrid")  # Set consistent style
# Create a grid of histograms, one for each season
g = sns.displot(
    data=bike,           # The DataFrame to use
    x='bikes_hired',     # Variable to plot
    col='season_name',   # Create separate plots for each season
    col_wrap=2,          # Arrange plots in 2 columns
    kind='hist',         # Make histogram plots
    bins=20,             # Use 20 bins per histogram
    hue='season_name',       # Bar color
    edgecolor='black',   # Edge color
    alpha=0.7            # Transparency
)

# Customize the grid of plots
g.set_axis_labels('Bikes Hired', 'Frequency')  # Set labels for all plots
g.set_titles('{col_name}')  # Use season name as title for each subplot
g.fig.suptitle('Distribution of Bikes Hired by Season', y=1.03)  # Overall title

plt.show()
No description has been provided for this image
In [214]:
# Creating a scatterplot to compare distributions of bike hires across weather conditions with weekend-weekday differences

plt.figure(figsize=(14, 7))
sns.stripplot(
    data=bike_filtered,
    x="icon",
    y="bikes_hired",
    hue="weekend",
    jitter=True,
    dodge=True,
    alpha=0.6,
    palette="Set2"
)

plt.title("Bike Hires by Weather Condition (Scatter, Weekend vs Weekday)", fontsize=16, weight="bold")
plt.xlabel("Weather Condition (Icon)", fontsize=14)
plt.ylabel("Number of Bikes Hired", fontsize=14)
plt.legend(title="Weekend?", fontsize=12, title_fontsize=13)
plt.grid(axis="y", linestyle="--", alpha=0.5)
plt.tight_layout()
plt.show()
No description has been provided for this image
In [215]:
# Creating a boxplot to compare distributions of bike hires across months with weekend-weekday differences
plt.figure(figsize=(14, 7))
sns.boxplot(
    data=bike_filtered,
    x="month_name",
    y="bikes_hired",
    hue="weekend",
    palette="Set3"
)

plt.title("Bike Hires Distribution by Weather Condition (Weekend vs Weekday)", fontsize=16, weight="bold")
plt.xlabel("Month", fontsize=14)
plt.ylabel("Number of Bikes Hired", fontsize=14)
plt.legend(title="Weekend?", fontsize=12, title_fontsize=13)
plt.grid(axis="y", linestyle="--", alpha=0.5)
plt.tight_layout()
plt.show()
No description has been provided for this image
In [216]:
# Creating a boxplot to compare distributions of bike hires across precipitation type with weekend-weekday differences
plt.figure(figsize=(14, 7))
sns.boxplot(
    data=bike_filtered,
    x="preciptype",
    y="bikes_hired",
    hue="weekend",
    palette="Set2"
)

plt.title("Bike Hires Distribution by Weather Condition (Weekend vs Weekday)", fontsize=16, weight="bold")
plt.xlabel("Precipitation Type", fontsize=14)
plt.ylabel("Number of Bikes Hired", fontsize=14)
plt.legend(title="Weekend?", fontsize=12, title_fontsize=13)
plt.grid(axis="y", linestyle="--", alpha=0.5)
plt.tight_layout()
plt.show()
No description has been provided for this image

3. Modelling Approach¶

Checking out correlations between variables and building an OLS regression model¶
In [217]:
# =============================================================================
# Model building
# =============================================================================


# Correlation - Scatterplot matrix
# Besides the number of bikes rented out on a given day, we have also downloaded 
# weather data for London such as mean_temp, humidity, pressure, precipitation, etc. 
# that measure weather conditions on a single day. It may be the case that more 
# bikes are rented out when it's warmer? Or how can we estimate the effect of rain on rentals?

# Your task is to build a regression model that helps you explain the number of rentals per day.

# Select  variables and create a correlation matrix
numerical_vars = ['weekend', 'cloudcover', 'humidity', 'precip', 'temp', 'feelslike', 'bikes_hired']
correlation_data = bike[numerical_vars]

# Create correlation heatmap
plt.figure(figsize=(10, 8))
correlation_matrix = correlation_data.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='crest', center=0,
            square=True, linewidths=0.5)
plt.title('Correlation Matrix of Numerical Variables', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

# Create pairplot for scatterplot matrix
plt.figure(figsize=(10, 8))
sns.pairplot(correlation_data, 
             hue='weekend',
             diag_kind='kde')
plt.suptitle('Scatterplot Matrix of Numerical Variables', y=1.02, fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
No description has been provided for this image
<Figure size 1000x800 with 0 Axes>
No description has been provided for this image
In [218]:
# Select only numeric columns
numeric_df = bike_filtered.select_dtypes(include=['int64', 'float64'])

# Compute correlation matrix
corr = numeric_df.corr()

# Correlation of all variables with bikes_hired
bikes_corr = corr['bikes_hired'].sort_values(ascending=False)

print(bikes_corr)

plt.figure(figsize=(10, 8))
plt.imshow(corr, cmap='magma', interpolation='nearest')
plt.colorbar()
plt.xticks(range(len(corr.columns)), corr.columns, rotation=90)
plt.yticks(range(len(corr.columns)), corr.columns)
plt.title("Correlation Matrix")
plt.show()
bikes_hired         1.000000
tempmax             0.663627
feelslikemax        0.654427
feelslike           0.619992
temp                0.616227
solarradiation      0.596642
solarenergy         0.596239
uvindex             0.587361
daylight_time       0.581021
feelslikemin        0.538198
tempmin             0.505682
dew                 0.457466
visibility          0.324185
sealevelpressure    0.270639
COVID               0.169397
week                0.112093
moonphase           0.000097
winddir            -0.056441
precip2            -0.065451
snow               -0.098805
snowdepth          -0.115190
precip             -0.228450
windspeed          -0.252994
windgust           -0.256263
precipprob         -0.330499
cloudcover         -0.345898
precipcover        -0.347468
humidity           -0.492140
Name: bikes_hired, dtype: float64
No description has been provided for this image
In [219]:
# Weekend or weekdays? Is there a difference?
print("\nT-test for difference between weekend and weekday bike rentals:")

# Split data into two groups
weekend_bikes = bike[bike['weekend'] == True]['bikes_hired']
weekday_bikes = bike[bike['weekend'] == False]['bikes_hired']

# Null hypothesis (H0): mean(weekend) == mean(weekday)
# Alternative hypothesis (H1): mean(weekend) != mean(weekday)

t_stat, p_value = stats.ttest_ind(weekend_bikes, weekday_bikes, equal_var=False)  # t-test

# Print results
print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.4f}")
print(f"Mean weekend rentals: {weekend_bikes.mean():.2f}")
print(f"Mean weekday rentals: {weekday_bikes.mean():.2f}")

alpha = 0.05
if p_value < alpha:
    print(f"\nReject the null hypothesis (p < {alpha}).")
    print("There is a statistically significant difference in bike rentals between weekends and weekdays.")
else:
    print(f"\nFail to reject the null hypothesis (p ≥ {alpha}).")
    print("No statistically significant difference found between weekend and weekday rentals.")
T-test for difference between weekend and weekday bike rentals:
T-statistic: -13.9286
P-value: 0.0000
Mean weekend rentals: 23219.72
Mean weekday rentals: 27571.74

Reject the null hypothesis (p < 0.05).
There is a statistically significant difference in bike rentals between weekends and weekdays.
In [220]:
#Function create to plot values between actuals and predicted

def plot_actual_vs_predicted(model, model_name, dataframe):
    """
    Plots actual data vs. model predictions with specific custom styling.

    Args:
        model (statsmodels.regression.linear_model.RegressionResultsWrapper):
            A fitted statsmodels model object.
        model_name (str):
            A descriptive name for the model to be used in the plot title.
        dataframe (pd.DataFrame):
            The original dataframe containing the 'date' and 'bikes_hired' columns.
    """
    print(f"\n--- Generating plot for {model_name} ---")

    # Create a copy of the dataframe and add predictions to ensure alignment
    df_plot = dataframe.copy()
    df_plot['predicted_hires'] = model.predict(dataframe)
    df_plot['date'] = pd.to_datetime(df_plot['date'])

    # Create figure with white background (like ggplot's theme_bw())
    plt.figure(figsize=(14, 7), facecolor='white')
    
    # Set white background for the plot area
    ax = plt.gca()
    ax.set_facecolor('white')
    
    # Add grid lines (like theme_bw())
    ax.grid(True, color='lightgray', linestyle='-', linewidth=0.5, alpha=0.7)
    ax.set_axisbelow(True)  # Put grid behind the data points

    # Plot the actual and predicted data
    sns.scatterplot(
        data=df_plot, x='date', y='bikes_hired',
        color='black', alpha=0.3, label='Actual Hires'
    )
    sns.scatterplot(
        data=df_plot, x='date', y='predicted_hires',
        color='blue', alpha=0.7, label='Predicted Hires'
    )

    # Remove padding from the x-axis so it starts and ends on the actual data range
    ax.set_xlim(df_plot['date'].min(), df_plot['date'].max())

    # Set custom date ticks to show every 6 months (Jan and Jul)
    ax.xaxis.set_major_locator(mdates.MonthLocator(bymonth=[1, 7]))
    ax.xaxis.set_major_formatter(mdates.DateFormatter('%b-%y'))

    # Improve label readability by rotating them
    plt.xticks(rotation=45, ha='right')

    # Set black borders around the plot (like theme_bw())
    for spine in ax.spines.values():
        spine.set_edgecolor('black')
        spine.set_linewidth(1)

    # Finalize the plot
    plt.title(f'Actual vs. Predicted Bike Hires - {model_name}')
    plt.xlabel('')
    plt.ylabel('Number of Bikes Hired')
    plt.legend()
    
    # Adjust layout to prevent label cutoff
    plt.tight_layout()
    
    plt.show()
In [221]:
# =============================================================================
# Model 0: using the mean to predict bikes_hired
# =============================================================================

# We start with the naive model where we just use the average to predict how many 
# bikes we are going to rent out on a single day

print("Summary statistics for bike_filtered:")
print(bike_filtered['bikes_hired'].describe()) 

# Create mean model using statsmodels
# --- Model 0: Intercept-only (Naive Mean Model) ---
print("\n--- Model 0: Intercept Only ---")
# Statsmodels Version
model0_sm = smf.ols('bikes_hired ~ 1', data=bike_filtered).fit()
print("Statsmodels - Model 0 Summary:")
print(model0_sm.summary().tables[1])
print(f"Residual SE: {model0_sm.scale ** 0.5:.3f}")  # Residual standard error is the square root of the scale (MSE of residuals).

# Plot actual vs predicted for Model 0
#plt(model0_sm, "Model 0: Mean Model", bike_filtered)

# Scikit-learn Version of the same model 
model0 = LinearRegression()
y = bike_filtered['bikes_hired']
y_pred_null = np.full_like(y, y.mean()) # Predict the mean for all observations
rmse_null = np.sqrt(mean_squared_error(y, y_pred_null))
print("\nScikit-learn - Model 0 (Null Model):")
print(f"  Prediction (Mean): {y.mean():.2f}")
print(f"  Root Mean Squared Error: {rmse_null:.2f}")
print(f"Residual SE: {model0_sm.scale ** 0.5:.3f}")  # Residual standard error is the square root of the scale (MSE of residuals).
Summary statistics for bike_filtered:
count     4017.000000
mean     27833.153099
std       9357.364234
min          0.000000
25%      21718.000000
50%      27676.000000
75%      34372.000000
max      73094.000000
Name: bikes_hired, dtype: float64

--- Model 0: Intercept Only ---
Statsmodels - Model 0 Summary:
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   2.783e+04    147.640    188.521      0.000    2.75e+04    2.81e+04
==============================================================================
Residual SE: 9357.364

Scikit-learn - Model 0 (Null Model):
  Prediction (Mean): 27833.15
  Root Mean Squared Error: 9356.20
Residual SE: 9357.364
In [222]:
# Split the data into train and test sets (80% train, 20% test)
train, test = train_test_split(bike_filtered, test_size=0.2, random_state=42)

# Define your model formula (the best OLS)
formula = 'bikes_hired ~ tempmax + windspeed + C(COVID) + C(icon) + C(weekend) + precip + visibility + C(day_of_week) + precip2'

# Fit the model on the training set
model_train = smf.ols(formula=formula, data=train).fit()
In [223]:
print("\n--- Model 2: bikes_hired ~ tempmax + windspeed + C(COVID) + C(icon) + C(weekend) + precip + visibility + C(day_of_week) + precip2 ---")
# Statsmodels Version
precip2 = bike_filtered["precip"] ** 2
month2 = bike_filtered["month"] ** 2
model1_sm = smf.ols('bikes_hired ~ tempmax + windspeed + C(COVID) + C(icon) + C(weekend) + precip  + visibility + C(day_of_week) + precip2 ', data=bike_filtered).fit()
print("Statsmodels - Model 1 Summary:")
print(model1_sm.summary().tables[1])
print("R-squared:", model1_sm.rsquared.round(4))
print("Adjusted R-squared:", model1_sm.rsquared_adj.round(4))
print(f"Residual SE: {model1_sm.scale ** 0.5:.3f}") # Residual standard error is the square root of the scale (MSE of residuals).

# Plot actual vs predicted for Model 1
plot_actual_vs_predicted(model1_sm, "Model 1: Temperature Model", bike_filtered)

# Questions:
# - Is the effect of temp significant? Why?
# - What proportion of the overall variability in bike rentals does temperature explain?

print(f"R-squared: {model1_sm.rsquared:.4f}")
print(f"Adjusted R-squared: {model1_sm.rsquared_adj:.4f}")
--- Model 2: bikes_hired ~ tempmax + windspeed + C(COVID) + C(icon) + C(weekend) + precip + visibility + C(day_of_week) + precip2 ---
Statsmodels - Model 1 Summary:
================================================================================================
                                   coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------------
Intercept                     1.544e+04    661.419     23.350      0.000    1.41e+04    1.67e+04
C(COVID)[T.1]                 1567.9292    232.103      6.755      0.000    1112.879    2022.980
C(icon)[T.cloudy]             -904.1495    804.736     -1.124      0.261   -2481.881     673.582
C(icon)[T.partly-cloudy-day]   865.9369    485.767      1.783      0.075     -86.438    1818.312
C(icon)[T.rain]               -157.7997    479.952     -0.329      0.742   -1098.773     783.174
C(icon)[T.snow]              -1324.5228   1139.197     -1.163      0.245   -3557.983     908.937
C(weekend)[T.True]           -1529.6283    197.488     -7.745      0.000   -1916.814   -1142.443
C(day_of_week)[T.Tue]         2210.7634    342.044      6.463      0.000    1540.167    2881.360
C(day_of_week)[T.Wed]         2504.8051    341.956      7.325      0.000    1834.381    3175.230
C(day_of_week)[T.Thu]         2520.0487    341.932      7.370      0.000    1849.671    3190.427
C(day_of_week)[T.Fri]         1005.5123    341.854      2.941      0.003     335.288    1675.736
C(day_of_week)[T.Sat]          382.7882    197.441      1.939      0.053      -4.306     769.882
C(day_of_week)[T.Sun]        -1912.4165    197.472     -9.684      0.000   -2299.572   -1525.261
tempmax                        849.3996     16.181     52.493      0.000     817.676     881.124
windspeed                     -209.1950     11.810    -17.713      0.000    -232.349    -186.041
precip                        -489.5830     30.772    -15.910      0.000    -549.913    -429.253
visibility                     170.6835      9.811     17.397      0.000     151.448     189.919
precip2                          2.8301      0.262     10.807      0.000       2.317       3.344
================================================================================================
R-squared: 0.6189
Adjusted R-squared: 0.6174
Residual SE: 5788.322

--- Generating plot for Model 1: Temperature Model ---
No description has been provided for this image
R-squared: 0.6189
Adjusted R-squared: 0.6174
In [224]:
print("\n--- Model 2 - Out-of-sample-testing : bikes_hired ~ tempmax + windspeed + C(COVID) + C(icon) + C(weekend) + precip + visibility + C(day_of_week) + precip2 ---")

print("\n-Train Dataset")
# Statsmodels Version
precip2 = bike_filtered["precip"] ** 2
month2 = bike_filtered["month"] ** 2
model_sm = smf.ols('bikes_hired ~ tempmax + windspeed + C(COVID) + C(icon) + C(weekend) + precip + visibility + C(day_of_week) + precip2 ', data=bike_filtered).fit()
print("Statsmodels - Model Summary:")
print(model_sm.summary().tables[1])


print("R-squared:", model_sm.rsquared.round(4))
print("Adjusted R-squared:", model_sm.rsquared_adj.round(4))
print(f"Residual SE: {model_sm.scale ** 0.5:.3f}") # Residual standard error is the square root of the scale (MSE of residuals).

print("\n-Test Dataset")

# Fit the model on the training set
model_sm = smf.ols(formula=formula, data=train).fit()

# Predictions on training set
y_train_pred = model_sm.predict(train)
rmse_train = np.sqrt(np.mean((train['bikes_hired'] - y_train_pred) ** 2))

# Predictions on test set
y_test_pred = model_sm.predict(test)
rmse_test = np.sqrt(np.mean((test['bikes_hired'] - y_test_pred) ** 2))


print(model_train.summary2())

print(f"Training RMSE: {rmse_train:.2f}")
print(f"Test RMSE: {rmse_test:.2f}")
--- Model 2 - Out-of-sample-testing : bikes_hired ~ tempmax + windspeed + C(COVID) + C(icon) + C(weekend) + precip + visibility + C(day_of_week) + precip2 ---

-Train Dataset
Statsmodels - Model Summary:
================================================================================================
                                   coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------------
Intercept                     1.544e+04    661.419     23.350      0.000    1.41e+04    1.67e+04
C(COVID)[T.1]                 1567.9292    232.103      6.755      0.000    1112.879    2022.980
C(icon)[T.cloudy]             -904.1495    804.736     -1.124      0.261   -2481.881     673.582
C(icon)[T.partly-cloudy-day]   865.9369    485.767      1.783      0.075     -86.438    1818.312
C(icon)[T.rain]               -157.7997    479.952     -0.329      0.742   -1098.773     783.174
C(icon)[T.snow]              -1324.5228   1139.197     -1.163      0.245   -3557.983     908.937
C(weekend)[T.True]           -1529.6283    197.488     -7.745      0.000   -1916.814   -1142.443
C(day_of_week)[T.Tue]         2210.7634    342.044      6.463      0.000    1540.167    2881.360
C(day_of_week)[T.Wed]         2504.8051    341.956      7.325      0.000    1834.381    3175.230
C(day_of_week)[T.Thu]         2520.0487    341.932      7.370      0.000    1849.671    3190.427
C(day_of_week)[T.Fri]         1005.5123    341.854      2.941      0.003     335.288    1675.736
C(day_of_week)[T.Sat]          382.7882    197.441      1.939      0.053      -4.306     769.882
C(day_of_week)[T.Sun]        -1912.4165    197.472     -9.684      0.000   -2299.572   -1525.261
tempmax                        849.3996     16.181     52.493      0.000     817.676     881.124
windspeed                     -209.1950     11.810    -17.713      0.000    -232.349    -186.041
precip                        -489.5830     30.772    -15.910      0.000    -549.913    -429.253
visibility                     170.6835      9.811     17.397      0.000     151.448     189.919
precip2                          2.8301      0.262     10.807      0.000       2.317       3.344
================================================================================================
R-squared: 0.6189
Adjusted R-squared: 0.6174
Residual SE: 5788.322

-Test Dataset
                            Results: Ordinary least squares
=======================================================================================
Model:                      OLS                    Adj. R-squared:           0.620     
Dependent Variable:         bikes_hired            AIC:                      64757.4969
Date:                       2025-09-10 23:23       BIC:                      64860.7712
No. Observations:           3213                   Log-Likelihood:           -32362.   
Df Model:                   16                     F-statistic:              328.8     
Df Residuals:               3196                   Prob (F-statistic):       0.00      
R-squared:                  0.622                  Scale:                    3.2989e+07
---------------------------------------------------------------------------------------
                               Coef.     Std.Err.    t     P>|t|    [0.025     0.975]  
---------------------------------------------------------------------------------------
Intercept                    15194.2729  730.8150  20.7909 0.0000 13761.3592 16627.1866
C(COVID)[T.1]                 1720.4153  256.0561   6.7189 0.0000  1218.3645  2222.4661
C(icon)[T.cloudy]              -93.6921  903.0693  -0.1037 0.9174 -1864.3460  1676.9618
C(icon)[T.partly-cloudy-day]  1095.2609  526.4740   2.0804 0.0376    63.0000  2127.5218
C(icon)[T.rain]                 91.5252  521.4307   0.1755 0.8607  -930.8474  1113.8977
C(icon)[T.snow]               -620.5594 1282.3316  -0.4839 0.6285 -3134.8353  1893.7165
C(weekend)[T.True]           -1611.0289  222.5938  -7.2375 0.0000 -2047.4700 -1174.5877
C(day_of_week)[T.Tue]         2228.8975  379.0936   5.8795 0.0000  1485.6061  2972.1888
C(day_of_week)[T.Wed]         2273.1796  379.9990   5.9821 0.0000  1528.1131  3018.2461
C(day_of_week)[T.Thu]         2575.8420  383.6790   6.7135 0.0000  1823.5600  3328.1240
C(day_of_week)[T.Fri]          949.3550  382.2902   2.4833 0.0131   199.7961  1698.9139
C(day_of_week)[T.Sat]          381.6113  220.8522   1.7279 0.0841   -51.4150   814.6376
C(day_of_week)[T.Sun]        -1992.6402  222.9681  -8.9369 0.0000 -2429.8152 -1555.4651
tempmax                        850.4917   17.9804  47.3010 0.0000   815.2374   885.7460
windspeed                     -208.5108   13.1943 -15.8031 0.0000  -234.3809  -182.6407
precip                        -482.2870   33.7555 -14.2877 0.0000  -548.4716  -416.1024
visibility                     168.7889   10.7874  15.6469 0.0000   147.6380   189.9398
precip2                          2.7572    0.2759   9.9925 0.0000     2.2162     3.2982
---------------------------------------------------------------------------------------
Omnibus:                 202.941          Durbin-Watson:             1.941             
Prob(Omnibus):           0.000            Jarque-Bera (JB):          855.464           
Skew:                    -0.128           Prob(JB):                  0.000             
Kurtosis:                5.515            Condition No.:             404652720913748480
=======================================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
[2] The smallest eigenvalue is 8.52e-27. This might indicate that                there
are strong multicollinearity problems or that the design                matrix is
singular.
Training RMSE: 5728.39
Test RMSE: 5973.22
In [225]:
# After fitting your model (model1_sm)
# Get the design matrix X from the fitted model
X = model1_sm.model.exog
features = model1_sm.model.exog_names

# Drop Intercept if present
if "Intercept" in features:
    idx = features.index("Intercept")
    X = X[:, [i for i in range(X.shape[1]) if i != idx]]
    features = [f for f in features if f != "Intercept"]

# Compute VIF
vif_df = pd.DataFrame({
    "feature": features,
    "VIF": [variance_inflation_factor(X, i) for i in range(X.shape[1])]
})

print("\n--- Variance Inflation Factor (VIF) ---")
print(vif_df.sort_values(by="VIF", ascending=False).round(3))
--- Variance Inflation Factor (VIF) ---
                         feature    VIF
5             C(weekend)[T.True]    inf
11         C(day_of_week)[T.Sun]    inf
10         C(day_of_week)[T.Sat]    inf
13                     windspeed  9.059
3                C(icon)[T.rain]  8.695
15                    visibility  7.780
12                       tempmax  6.181
2   C(icon)[T.partly-cloudy-day]  3.625
14                        precip  3.332
16                       precip2  2.862
8          C(day_of_week)[T.Thu]  1.884
7          C(day_of_week)[T.Wed]  1.879
9          C(day_of_week)[T.Fri]  1.864
6          C(day_of_week)[T.Tue]  1.861
0                  C(COVID)[T.1]  1.641
1              C(icon)[T.cloudy]  1.162
4                C(icon)[T.snow]  1.111
In [226]:
# Run Breusch-Pagan test
bp_test = het_breuschpagan(model1_sm.resid, model1_sm.model.exog)

# Unpack results
bp_stat, bp_pvalue, f_stat, f_pvalue = bp_test

print("Breusch-Pagan test results:")
print(f"  Lagrange Multiplier statistic: {bp_stat:.4f}")
print(f"  LM test p-value: {bp_pvalue:.4f}")
print(f"  F-statistic: {f_stat:.4f}")
print(f"  F-test p-value: {f_pvalue:.4f}")

# Interpretation
alpha = 0.05
if bp_pvalue < alpha:
    print("\nReject the null hypothesis: Evidence of heteroskedasticity.")
else:
    print("\nFail to reject the null hypothesis: No evidence of heteroskedasticity.")
Breusch-Pagan test results:
  Lagrange Multiplier statistic: 325.0416
  LM test p-value: 0.0000
  F-statistic: 22.0101
  F-test p-value: 0.0000

Reject the null hypothesis: Evidence of heteroskedasticity.
In [227]:
# Split the data into train and test sets (80% train, 20% test)
train, test = train_test_split(bike_filtered, test_size=0.2, random_state=42)

# Defining model formula (Model 1)
specification = 'bikes_hired ~ tempmax + daylight_time + C(icon) + C(weekend) + C(season_name) + humidity + precipprob'

# Model 2
formula = 'bikes_hired ~ tempmax + windspeed + C(COVID) + C(icon) + C(weekend) + precip + visibility + C(day_of_week) + precip2'


# Fit the model on the training set
model_train = smf.ols(formula=formula, data=train).fit()
In [228]:
# Model 1 - Linear Fit

model = smf.ols(specification, data=train).fit()
print(model.summary())

y_pred = model.predict(test) # fit to test data
rmse = np.sqrt(mean_squared_error(test['bikes_hired'], y_pred)) # generate rmse metric
r2 = r2_score(test['bikes_hired'], y_pred) # generate r squared metric

# print
print(f"Test RMSE: {rmse:.2f}")
print(f"Test R²: {r2:.2f}")

# VIF code
y, X = dmatrices(specification, data=train, return_type="dataframe") # build design matrices (y = target, X = predictors including dummies & interactions)

# Compute VIF for each column in X to check
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

print("\nVariance Inflation Factors:")
print(vif_data)

# create plot for prediction vs actual

# ensure set to date time
test['date'] = pd.to_datetime(test['date'])

# create copy to avoid disrupting original dataframe
test_sorted = test.copy()

# amend predictions as column
test_sorted['y_pred'] = y_pred

# sort by date for neat look in graph
test_sorted = test_sorted.sort_values('date')

# plotting out the graph
plt.figure(figsize=(12,6))
plt.plot(test_sorted['date'], test_sorted['bikes_hired'], label='Actual', color='blue') # plot actual
plt.plot(test_sorted['date'], test_sorted['y_pred'], label='Predicted', color='red', linestyle='--') # plot predictions
plt.title("Predicted vs Actual Bikes Hired (Test Sample, Model 1)", fontsize=14) # title
plt.xlabel("Date")
plt.ylabel("Bikes Hired")
plt.legend()
plt.show()
                            OLS Regression Results                            
==============================================================================
Dep. Variable:            bikes_hired   R-squared:                       0.548
Model:                            OLS   Adj. R-squared:                  0.547
Method:                 Least Squares   F-statistic:                     353.1
Date:                Wed, 10 Sep 2025   Prob (F-statistic):               0.00
Time:                        23:23:23   Log-Likelihood:                -32649.
No. Observations:                3213   AIC:                         6.532e+04
Df Residuals:                    3201   BIC:                         6.539e+04
Df Model:                          11                                         
Covariance Type:            nonrobust                                         
================================================================================================
                                   coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------------
Intercept                     2.034e+04   1611.190     12.627      0.000    1.72e+04    2.35e+04
C(icon)[T.cloudy]             -868.7907    997.875     -0.871      0.384   -2825.330    1087.748
C(icon)[T.partly-cloudy-day]   263.2170    583.404      0.451      0.652    -880.666    1407.100
C(icon)[T.rain]               1046.4496    629.399      1.663      0.096    -187.616    2280.515
C(icon)[T.snow]              -1046.7858    629.463     -1.663      0.096   -2280.978     187.406
C(weekend)[T.True]           -4016.6069    246.860    -16.271      0.000   -4500.627   -3532.587
C(season_name)[T.Spring]     -1064.5449    537.028     -1.982      0.048   -2117.498     -11.592
C(season_name)[T.Summer]       782.6292    672.384      1.164      0.245    -535.717    2100.975
C(season_name)[T.Autumn]      3817.3756    383.640      9.950      0.000    3065.171    4569.580
tempmax                        433.2920     33.653     12.875      0.000     367.308     499.276
daylight_time                    0.0026      0.000      9.800      0.000       0.002       0.003
humidity                      -111.6164     15.008     -7.437      0.000    -141.042     -82.191
precipprob                     -33.6238      8.783     -3.828      0.000     -50.846     -16.402
==============================================================================
Omnibus:                      172.279   Durbin-Watson:                   1.935
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              578.010
Skew:                           0.175   Prob(JB):                    3.07e-126
Kurtosis:                       5.048   Cond. No.                     1.10e+21
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 5.51e-26. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
Test RMSE: 6433.71
Test R²: 0.54

Variance Inflation Factors:
                         feature         VIF
0                      Intercept  211.829516
1              C(icon)[T.cloudy]    1.537673
2   C(icon)[T.partly-cloudy-day]    4.918740
3                C(icon)[T.rain]         inf
4                C(icon)[T.snow]         inf
5             C(weekend)[T.True]    1.001384
6       C(season_name)[T.Spring]    4.396925
7       C(season_name)[T.Summer]    7.001070
8       C(season_name)[T.Autumn]    2.245775
9                        tempmax    3.699342
10                 daylight_time    6.928360
11                      humidity    2.045534
12                    precipprob         inf
No description has been provided for this image
In [229]:
# Model 2 - OLS along with polynomial relationship
print("\n--- Model 2: bikes_hired ~ tempmax + windspeed + C(COVID) + C(icon) + C(weekend) + precip + visibility + C(day_of_week) + precip2 ---")

# Redefining precip 2 to be safe
precip2 = bike_filtered["precip"] ** 2

# Statsmodels Version
model2_sm = smf.ols('bikes_hired ~ tempmax + windspeed + C(COVID) + C(icon) + C(weekend) + precip + visibility + C(day_of_week) + precip2 ', data=bike_filtered).fit()

# Printing out results data
print("Statsmodels - Model 2 Summary:")
print(model2_sm.summary().tables[1])
print("R-squared:", model2_sm.rsquared.round(4))
print("Adjusted R-squared:", model2_sm.rsquared_adj.round(4))
print(f"Residual SE: {model2_sm.scale ** 0.5:.3f}") # Residual standard error is the square root of the scale (MSE of residuals).

# Plot actual vs predicted for Model 1
plot_actual_vs_predicted(model2_sm, "Model 2: Non-linear Model", bike_filtered)
--- Model 2: bikes_hired ~ tempmax + windspeed + C(COVID) + C(icon) + C(weekend) + precip + visibility + C(day_of_week) + precip2 ---
Statsmodels - Model 2 Summary:
================================================================================================
                                   coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------------
Intercept                     1.544e+04    661.419     23.350      0.000    1.41e+04    1.67e+04
C(COVID)[T.1]                 1567.9292    232.103      6.755      0.000    1112.879    2022.980
C(icon)[T.cloudy]             -904.1495    804.736     -1.124      0.261   -2481.881     673.582
C(icon)[T.partly-cloudy-day]   865.9369    485.767      1.783      0.075     -86.438    1818.312
C(icon)[T.rain]               -157.7997    479.952     -0.329      0.742   -1098.773     783.174
C(icon)[T.snow]              -1324.5228   1139.197     -1.163      0.245   -3557.983     908.937
C(weekend)[T.True]           -1529.6283    197.488     -7.745      0.000   -1916.814   -1142.443
C(day_of_week)[T.Tue]         2210.7634    342.044      6.463      0.000    1540.167    2881.360
C(day_of_week)[T.Wed]         2504.8051    341.956      7.325      0.000    1834.381    3175.230
C(day_of_week)[T.Thu]         2520.0487    341.932      7.370      0.000    1849.671    3190.427
C(day_of_week)[T.Fri]         1005.5123    341.854      2.941      0.003     335.288    1675.736
C(day_of_week)[T.Sat]          382.7882    197.441      1.939      0.053      -4.306     769.882
C(day_of_week)[T.Sun]        -1912.4165    197.472     -9.684      0.000   -2299.572   -1525.261
tempmax                        849.3996     16.181     52.493      0.000     817.676     881.124
windspeed                     -209.1950     11.810    -17.713      0.000    -232.349    -186.041
precip                        -489.5830     30.772    -15.910      0.000    -549.913    -429.253
visibility                     170.6835      9.811     17.397      0.000     151.448     189.919
precip2                          2.8301      0.262     10.807      0.000       2.317       3.344
================================================================================================
R-squared: 0.6189
Adjusted R-squared: 0.6174
Residual SE: 5788.322

--- Generating plot for Model 2: Non-linear Model ---
No description has been provided for this image
In [230]:
# Model 2 Out-of-sample Testing

print("\n--- Model 2 - Out-of-sample-testing : bikes_hired ~ tempmax + windspeed + C(COVID) + C(icon) + C(weekend) + precip + visibility + C(day_of_week) + precip2 ---")

# Training dataset
print("\n-Train Dataset")

# Statsmodels Version
model_sm = smf.ols('bikes_hired ~ tempmax + windspeed + C(COVID) + C(icon) + C(weekend) + precip + visibility + C(day_of_week) + precip2 ', data=bike_filtered).fit()
print("Statsmodels - Model Summary:")
print(model_sm.summary().tables[1])

# Printing results data
print("R-squared:", model_sm.rsquared.round(4))
print("Adjusted R-squared:", model_sm.rsquared_adj.round(4))
print(f"Residual SE: {model_sm.scale ** 0.5:.3f}") # Residual standard error is the square root of the scale (MSE of residuals).


# Testing dataset
print("\n-Test Dataset")

# Fit the model on the training set
model_sm = smf.ols(formula=formula, data=train).fit()

# Predictions on training set
y_train_pred = model_sm.predict(train)
rmse_train = np.sqrt(np.mean((train['bikes_hired'] - y_train_pred) ** 2))

# Predictions on test set
y_test_pred = model_sm.predict(test)
rmse_test = np.sqrt(np.mean((test['bikes_hired'] - y_test_pred) ** 2))

# Printing results data
print(model_train.summary2())
print(f"Training RMSE: {rmse_train:.2f}")
print(f"Test RMSE: {rmse_test:.2f}")
--- Model 2 - Out-of-sample-testing : bikes_hired ~ tempmax + windspeed + C(COVID) + C(icon) + C(weekend) + precip + visibility + C(day_of_week) + precip2 ---

-Train Dataset
Statsmodels - Model Summary:
================================================================================================
                                   coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------------
Intercept                     1.544e+04    661.419     23.350      0.000    1.41e+04    1.67e+04
C(COVID)[T.1]                 1567.9292    232.103      6.755      0.000    1112.879    2022.980
C(icon)[T.cloudy]             -904.1495    804.736     -1.124      0.261   -2481.881     673.582
C(icon)[T.partly-cloudy-day]   865.9369    485.767      1.783      0.075     -86.438    1818.312
C(icon)[T.rain]               -157.7997    479.952     -0.329      0.742   -1098.773     783.174
C(icon)[T.snow]              -1324.5228   1139.197     -1.163      0.245   -3557.983     908.937
C(weekend)[T.True]           -1529.6283    197.488     -7.745      0.000   -1916.814   -1142.443
C(day_of_week)[T.Tue]         2210.7634    342.044      6.463      0.000    1540.167    2881.360
C(day_of_week)[T.Wed]         2504.8051    341.956      7.325      0.000    1834.381    3175.230
C(day_of_week)[T.Thu]         2520.0487    341.932      7.370      0.000    1849.671    3190.427
C(day_of_week)[T.Fri]         1005.5123    341.854      2.941      0.003     335.288    1675.736
C(day_of_week)[T.Sat]          382.7882    197.441      1.939      0.053      -4.306     769.882
C(day_of_week)[T.Sun]        -1912.4165    197.472     -9.684      0.000   -2299.572   -1525.261
tempmax                        849.3996     16.181     52.493      0.000     817.676     881.124
windspeed                     -209.1950     11.810    -17.713      0.000    -232.349    -186.041
precip                        -489.5830     30.772    -15.910      0.000    -549.913    -429.253
visibility                     170.6835      9.811     17.397      0.000     151.448     189.919
precip2                          2.8301      0.262     10.807      0.000       2.317       3.344
================================================================================================
R-squared: 0.6189
Adjusted R-squared: 0.6174
Residual SE: 5788.322

-Test Dataset
                            Results: Ordinary least squares
=======================================================================================
Model:                      OLS                    Adj. R-squared:           0.620     
Dependent Variable:         bikes_hired            AIC:                      64757.4969
Date:                       2025-09-10 23:23       BIC:                      64860.7712
No. Observations:           3213                   Log-Likelihood:           -32362.   
Df Model:                   16                     F-statistic:              328.8     
Df Residuals:               3196                   Prob (F-statistic):       0.00      
R-squared:                  0.622                  Scale:                    3.2989e+07
---------------------------------------------------------------------------------------
                               Coef.     Std.Err.    t     P>|t|    [0.025     0.975]  
---------------------------------------------------------------------------------------
Intercept                    15194.2729  730.8150  20.7909 0.0000 13761.3592 16627.1866
C(COVID)[T.1]                 1720.4153  256.0561   6.7189 0.0000  1218.3645  2222.4661
C(icon)[T.cloudy]              -93.6921  903.0693  -0.1037 0.9174 -1864.3460  1676.9618
C(icon)[T.partly-cloudy-day]  1095.2609  526.4740   2.0804 0.0376    63.0000  2127.5218
C(icon)[T.rain]                 91.5252  521.4307   0.1755 0.8607  -930.8474  1113.8977
C(icon)[T.snow]               -620.5594 1282.3316  -0.4839 0.6285 -3134.8353  1893.7165
C(weekend)[T.True]           -1611.0289  222.5938  -7.2375 0.0000 -2047.4700 -1174.5877
C(day_of_week)[T.Tue]         2228.8975  379.0936   5.8795 0.0000  1485.6061  2972.1888
C(day_of_week)[T.Wed]         2273.1796  379.9990   5.9821 0.0000  1528.1131  3018.2461
C(day_of_week)[T.Thu]         2575.8420  383.6790   6.7135 0.0000  1823.5600  3328.1240
C(day_of_week)[T.Fri]          949.3550  382.2902   2.4833 0.0131   199.7961  1698.9139
C(day_of_week)[T.Sat]          381.6113  220.8522   1.7279 0.0841   -51.4150   814.6376
C(day_of_week)[T.Sun]        -1992.6402  222.9681  -8.9369 0.0000 -2429.8152 -1555.4651
tempmax                        850.4917   17.9804  47.3010 0.0000   815.2374   885.7460
windspeed                     -208.5108   13.1943 -15.8031 0.0000  -234.3809  -182.6407
precip                        -482.2870   33.7555 -14.2877 0.0000  -548.4716  -416.1024
visibility                     168.7889   10.7874  15.6469 0.0000   147.6380   189.9398
precip2                          2.7572    0.2759   9.9925 0.0000     2.2162     3.2982
---------------------------------------------------------------------------------------
Omnibus:                 202.941          Durbin-Watson:             1.941             
Prob(Omnibus):           0.000            Jarque-Bera (JB):          855.464           
Skew:                    -0.128           Prob(JB):                  0.000             
Kurtosis:                5.515            Condition No.:             404652720913748480
=======================================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
[2] The smallest eigenvalue is 8.52e-27. This might indicate that                there
are strong multicollinearity problems or that the design                matrix is
singular.
Training RMSE: 5728.39
Test RMSE: 5973.22
In [231]:
# VIF Tests (Model 2)

# We carried out VIF testing using the statsmodel library, rightaway from the fitted models
X = model1_sm.model.exog

# Defining features as the variables we used to predict in our model
features = model1_sm.model.exog_names


# Compute VIF stats
vif_df = pd.DataFrame({
    "feature": features,
    "VIF": [variance_inflation_factor(X, i) for i in range(X.shape[1])]
})

# Printing out results
print("\n--- Variance Inflation Factor (VIF) ---")
print(vif_df.sort_values(by="VIF", ascending=False).round(3))
--- Variance Inflation Factor (VIF) ---
                         feature     VIF
11         C(day_of_week)[T.Sat]     inf
6             C(weekend)[T.True]     inf
12         C(day_of_week)[T.Sun]     inf
0                      Intercept  52.451
4                C(icon)[T.rain]   5.851
3   C(icon)[T.partly-cloudy-day]   5.060
15                        precip   3.017
17                       precip2   2.857
8          C(day_of_week)[T.Wed]   1.717
9          C(day_of_week)[T.Thu]   1.717
10         C(day_of_week)[T.Fri]   1.716
7          C(day_of_week)[T.Tue]   1.715
2              C(icon)[T.cloudy]   1.534
16                    visibility   1.410
5                C(icon)[T.snow]   1.306
13                       tempmax   1.241
1                  C(COVID)[T.1]   1.224
14                     windspeed   1.185
In [232]:
# Model 3: XGBoost with the same specification

target = "bikes_hired" # define target
features = ["tempmax", "daylight_time", "icon", "weekend", "season_name", "humidity", "precipprob","year"] # define features

X = bike_filtered[features].copy() # duplicate dataset
y = bike_filtered[target] # rename target by convention

# xgboost cannot handle categorical variables, so must one hot encode
X = pd.get_dummies(X, columns=["icon", "weekend", "season_name"], drop_first=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123) # train test split

# Fit model
model = xgb.XGBRegressor( 
    n_estimators=500, # number of trees
    learning_rate=0.05, # within regular range
    max_depth=6,  # maximum 6 layers of trees
    subsample=0.8, # each tree trained on 80% of data
    colsample_bytree=0.8, # tree sees 80% of features
    random_state=123, # for reproducibility
    gamma=1.0, # minimum loss reduction for a split (prevent overfitting)
    reg_lambda=1.0, # L2 regularization (prevent overfitting)
    reg_alpha=0.1, # L1 regularization (prevent overfitting)
)

# Fit the model
model.fit(X_train, y_train)
y_pred = model.predict(X_test) # generate predictions as before

# Recompute metrics as before
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print(f"Test RMSE: {rmse:.2f}")
print(f"Test R²: {r2:.2f}")

# Replicate code for graphing
test_plot = test.copy()
test_plot["y_pred"] = model.predict(X.loc[test.index])  # align predictions with test set rows
test_plot["date"] = pd.to_datetime(test_plot["date"])
test_plot = test_plot.sort_values("date")

plt.figure(figsize=(12,6))
plt.plot(test_plot["date"], test_plot["bikes_hired"], label="Actual", color="blue")
plt.plot(test_plot["date"], test_plot["y_pred"], label="Predicted", color="red", linestyle="--")
plt.title("Predicted vs Actual Bikes Hired (Model 3)", fontsize=14)
plt.xlabel("Date")
plt.ylabel("Bikes Hired")
plt.legend()
plt.show()

# Most important features plot
xgb.plot_importance(model, max_num_features=10)
plt.title("Top 10 Important Features (XGBoost)")
plt.show()

# Partial dependence plot
features_to_plot = ["tempmax","weekend_True","humidity","year"] # selected most important features
fig, ax = plt.subplots(figsize=(10, 8))
PartialDependenceDisplay.from_estimator(model, X_train, features_to_plot, grid_resolution=50, ax=ax)
fig.suptitle("Partial Dependence Plots for Most Important Features", fontsize=16, y=1.02)
plt.show()
Test RMSE: 4941.74
Test R²: 0.75
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Conclusions/Recommendations:¶

Stock more bikes in the Autumn to increase availability during peak season (as bike hires increase in Autumn / all months have negative coefficients compared to Autumn)​¶
Schedule maintenance on weekends when bike demand is lower (weekend coefficient < weekday).​¶
Create a marketing campaign that incentivizes people to use more bikes in the winter (high decrease in demand during the winter according to model)​¶
Staff more employees on peak usage days (Wednesday, Thursday) to manage returns and ensure there are enough bikes for hire (e.g., docks in the city should be empty in the morning for parking and stocked in the afternoon for people leaving)​¶
These recommendations can be strengthened with the addition of pricing data for each day, allowing for elasticity calculations¶